Spaces:
Running on Zero
Running on Zero
| # Dataset Plan | |
| ## Status | |
| The project now has a deterministic SFT preview generator for local planning and schema validation. | |
| Current preview artifact: | |
| ```bash | |
| .venv/bin/python -B scripts/generate_dataset.py | |
| ``` | |
| Default output: | |
| ```text | |
| data/train/objectverse_sft_preview.jsonl | |
| ``` | |
| This preview is mock-generated. It is not a final training dataset and should not be described as real model output. | |
| The preview JSONL file is evidence for schema and workflow readiness only. | |
| Curated v1 training-test artifact: | |
| ```bash | |
| .venv/bin/python -B scripts/prepare_curated_dataset.py \ | |
| --count 50 \ | |
| --output data/train/objectverse_sft_curated.jsonl | |
| ``` | |
| This file is synthetic curated data: hand-shaped, deterministic, privacy-safe, and useful for testing the LoRA pipeline. It is not based on private user photos or commercial AI output. | |
| Published synthetic curated dataset: | |
| ```text | |
| https://huggingface.co/datasets/qqyule/objectverse-diary-sft-curated | |
| ``` | |
| Current curated v2 artifact: | |
| ```bash | |
| .venv/bin/python -B scripts/prepare_curated_dataset.py \ | |
| --version v2 \ | |
| --count 200 \ | |
| --output data/train/objectverse_sft_curated_v2.jsonl | |
| ``` | |
| The published dataset repo now includes `objectverse_sft_curated_v2.jsonl`: 200 synthetic curated rows covering 40 everyday objects and 5 personality modes, with exactly 40 rows per mode and no repeated object-mode pair. The v1 file remains preserved through repository history. | |
| ## Target Dataset | |
| Target before stronger fine-tuning: | |
| - 200-500 generated or curated object-persona-diary samples | |
| - at least 50 manually curated high-quality samples | |
| - no private user photos | |
| - no emails, tokens, serial numbers, or other sensitive identifiers | |
| - English-first output with optional Chinese helper text | |
| ## JSONL Schema | |
| Each line is one training candidate: | |
| ```json | |
| { | |
| "id": "sft-preview-0001", | |
| "source": "objectverse-diary-mock-mvp", | |
| "split": "preview", | |
| "mode": "Cynical", | |
| "object_description": "old white coffee mug on a developer desk", | |
| "object_understanding": {}, | |
| "messages": [ | |
| {"role": "system", "content": "..."}, | |
| {"role": "user", "content": "..."}, | |
| {"role": "assistant", "content": "{\"persona\":{},\"diary\":{}}"} | |
| ] | |
| } | |
| ``` | |
| The `assistant.content` field is JSON text so it can be used with chat-style SFT tools while preserving structured outputs. | |
| ## Generation Workflow | |
| Preview: | |
| ```bash | |
| .venv/bin/python -B scripts/generate_dataset.py --count 60 | |
| ``` | |
| Full candidate pool later: | |
| ```bash | |
| .venv/bin/python -B scripts/generate_dataset.py --count 300 --output data/train/objectverse_sft_candidates.jsonl | |
| ``` | |
| Manual curation should happen after generation. For a stronger LoRA run, curate 150-300 rows from a broader object/mode/scene pool and leave 10-15% for evaluation. Do not publish the full candidate file until it has been reviewed. | |
| Space VLM validation traces under `data/traces/space-vlm/` are failure evidence because they include `vision-fallback-to-mock`. Do not mix them into curated training data or describe them as successful real VLM outputs. | |
| ## Modal LoRA Training Scaffold | |
| The repository includes a Modal training scaffold for the future Well-Tuned path. It is not run by default and does not affect the Gradio Space runtime. | |
| Install the local Modal CLI dependency separately: | |
| ```bash | |
| pip install -r requirements-training.txt | |
| ``` | |
| Validate the local JSONL shape without Modal auth or GPU usage: | |
| ```bash | |
| .venv/bin/python -B scripts/finetune_lora.py \ | |
| --dry-run \ | |
| --dataset data/train/objectverse_sft_curated.jsonl \ | |
| --run-name objectverse-diary-qwen15b-curated-test | |
| ``` | |
| The first badge-evidence run used 20 steps on 50 synthetic curated rows. For a higher-quality v2 run, validate the larger curated file first: | |
| ```bash | |
| .venv/bin/python -B scripts/finetune_lora.py \ | |
| --dry-run \ | |
| --dataset data/train/objectverse_sft_curated_v2.jsonl \ | |
| --run-name objectverse-diary-qwen15b-lora-v2 \ | |
| --max-steps 120 \ | |
| --learning-rate 1e-4 \ | |
| --max-seq-length 1536 \ | |
| --lora-r 32 \ | |
| --lora-alpha 64 \ | |
| --per-device-train-batch-size 2 \ | |
| --gradient-accumulation-steps 4 \ | |
| --eval-ratio 0.1 \ | |
| --eval-steps 20 | |
| ``` | |
| Executed v2 training command: | |
| ```bash | |
| modal run --timestamps -n objectverse-diary-qwen15b-lora-v2 scripts/finetune_lora.py \ | |
| --dataset data/train/objectverse_sft_curated_v2.jsonl \ | |
| --run-name objectverse-diary-qwen15b-lora-v2 \ | |
| --max-steps 120 \ | |
| --learning-rate 1e-4 \ | |
| --max-seq-length 1536 \ | |
| --lora-r 32 \ | |
| --lora-alpha 64 \ | |
| --per-device-train-batch-size 2 \ | |
| --gradient-accumulation-steps 4 \ | |
| --eval-ratio 0.1 \ | |
| --eval-steps 20 | |
| ``` | |
| Current Modal status: the v2 job completed successfully and produced the published LoRA adapter at `https://huggingface.co/qqyule/objectverse-diary-qwen15b-lora`. | |
| Current v2 run summary: | |
| - run name: `objectverse-diary-qwen15b-lora-v2` | |
| - dataset: `data/train/objectverse_sft_curated_v2.jsonl` | |
| - dataset repo path: `objectverse_sft_curated_v2.jsonl` | |
| - records: 200 total, 180 train, 20 eval | |
| - base model: `Qwen/Qwen2.5-1.5B-Instruct` | |
| - max steps: 120 | |
| - learning rate: `1e-4` | |
| - max sequence length: 1536 | |
| - LoRA rank / alpha / dropout: 32 / 64 / 0.05 | |
| - effective batch size: 8 | |
| - assistant-output-only loss: enabled | |
| - train loss: 0.3240 | |
| - eval loss: 0.0162 | |
| - train runtime: 140.3364s | |
| - epoch: 5.2222 | |
| - local adapter export: ignored `exports/objectverse-diary-qwen15b-lora-v2-adapter-dir/` | |
| - model repo: `https://huggingface.co/qqyule/objectverse-diary-qwen15b-lora` | |
| Additional v2 scaffold validation run: `objectverse-diary-qwen15b-lora-v2-curated50-retry1` completed on Modal with the existing 50-row curated dataset, using assistant-output-only loss, 45 train rows, 5 eval rows, `max_steps=120`, `learning_rate=1e-4`, `max_seq_length=1536`, LoRA `r=32`, `alpha=64`, and effective batch size 8. Metrics: `train_loss=0.2551`, `eval_loss=0.0093`, `train_runtime=146.5398s`, `epoch=20.0`. The adapter was downloaded to ignored local `exports/`; it has not been published to Hugging Face Hub. | |
| Default training scaffold settings: | |
| - base model: `Qwen/Qwen2.5-1.5B-Instruct` | |
| - LoRA adapter target: persona and diary JSON output | |
| - default loss: assistant-output-only labels, with prompt tokens masked | |
| - default eval split: 10% when the dataset has at least two rows | |
| - GPU: Modal `A10G` | |
| - output: Modal Volume artifacts, not committed files | |
| The current `objectverse_sft_preview.jsonl` file is mock-generated and should only be used to validate the training pipeline. It is not final Well-Tuned evidence. Do not store Modal credit codes, tokens, Hugging Face tokens, or private datasets in the repo. | |
| The published `objectverse_sft_curated_v2.jsonl` dataset is synthetic curated training data. It is suitable for hackathon training evidence, but it should still be described honestly as deterministic synthetic curation rather than real user trace data. | |
| ## Curation Checklist | |
| - Persona stays consistent with the object. | |
| - Diary is short, vivid, and English-first. | |
| - Chinese helper text is secondary. | |
| - Output has a strange object archive feeling. | |
| - No real person, email, token, address, credit code, or serial number remains. | |
| - No commercial cloud AI model was used to create the sample. | |
| - JSON parses cleanly. | |
| ## Publishing Notes | |
| When publishing to Hugging Face Datasets: | |
| - create a dataset card | |
| - document that mock preview rows are synthetic | |
| - separate curated rows from raw candidates | |
| - include license and privacy notes | |
| - keep private images out of the repo | |
| Curated v2 was published with: | |
| ```bash | |
| .venv/bin/python -B scripts/publish_hf_dataset.py \ | |
| --dataset-file data/train/objectverse_sft_curated_v2.jsonl \ | |
| --repo-id qqyule/objectverse-diary-sft-curated \ | |
| --path-in-repo objectverse_sft_curated_v2.jsonl \ | |
| --commit-message "Upload Objectverse Diary curated v2 dataset" | |
| ``` | |