ObjectverseDiary / docs /DATASET.md
qqyule's picture
Deploy latest Objectverse Diary from fa09aac
dd6cefc verified

A newer version of the Gradio SDK is available: 6.17.3

Upgrade

Dataset Plan

Status

The project now has a deterministic SFT preview generator for local planning and schema validation.

Current preview artifact:

.venv/bin/python -B scripts/generate_dataset.py

Default output:

data/train/objectverse_sft_preview.jsonl

This preview is mock-generated. It is not a final training dataset and should not be described as real model output.

The preview JSONL file is evidence for schema and workflow readiness only.

Curated v1 training-test artifact:

.venv/bin/python -B scripts/prepare_curated_dataset.py \
  --count 50 \
  --output data/train/objectverse_sft_curated.jsonl

This file is synthetic curated data: hand-shaped, deterministic, privacy-safe, and useful for testing the LoRA pipeline. It is not based on private user photos or commercial AI output.

Published synthetic curated dataset:

https://huggingface.co/datasets/qqyule/objectverse-diary-sft-curated

Current curated v2 artifact:

.venv/bin/python -B scripts/prepare_curated_dataset.py \
  --version v2 \
  --count 200 \
  --output data/train/objectverse_sft_curated_v2.jsonl

The published dataset repo now includes objectverse_sft_curated_v2.jsonl: 200 synthetic curated rows covering 40 everyday objects and 5 personality modes, with exactly 40 rows per mode and no repeated object-mode pair. The v1 file remains preserved through repository history.

Target Dataset

Target before stronger fine-tuning:

  • 200-500 generated or curated object-persona-diary samples
  • at least 50 manually curated high-quality samples
  • no private user photos
  • no emails, tokens, serial numbers, or other sensitive identifiers
  • English-first output with optional Chinese helper text

JSONL Schema

Each line is one training candidate:

{
  "id": "sft-preview-0001",
  "source": "objectverse-diary-mock-mvp",
  "split": "preview",
  "mode": "Cynical",
  "object_description": "old white coffee mug on a developer desk",
  "object_understanding": {},
  "messages": [
    {"role": "system", "content": "..."},
    {"role": "user", "content": "..."},
    {"role": "assistant", "content": "{\"persona\":{},\"diary\":{}}"}
  ]
}

The assistant.content field is JSON text so it can be used with chat-style SFT tools while preserving structured outputs.

Generation Workflow

Preview:

.venv/bin/python -B scripts/generate_dataset.py --count 60

Full candidate pool later:

.venv/bin/python -B scripts/generate_dataset.py --count 300 --output data/train/objectverse_sft_candidates.jsonl

Manual curation should happen after generation. For a stronger LoRA run, curate 150-300 rows from a broader object/mode/scene pool and leave 10-15% for evaluation. Do not publish the full candidate file until it has been reviewed.

Space VLM validation traces under data/traces/space-vlm/ are failure evidence because they include vision-fallback-to-mock. Do not mix them into curated training data or describe them as successful real VLM outputs.

Modal LoRA Training Scaffold

The repository includes a Modal training scaffold for the future Well-Tuned path. It is not run by default and does not affect the Gradio Space runtime.

Install the local Modal CLI dependency separately:

pip install -r requirements-training.txt

Validate the local JSONL shape without Modal auth or GPU usage:

.venv/bin/python -B scripts/finetune_lora.py \
  --dry-run \
  --dataset data/train/objectverse_sft_curated.jsonl \
  --run-name objectverse-diary-qwen15b-curated-test

The first badge-evidence run used 20 steps on 50 synthetic curated rows. For a higher-quality v2 run, validate the larger curated file first:

.venv/bin/python -B scripts/finetune_lora.py \
  --dry-run \
  --dataset data/train/objectverse_sft_curated_v2.jsonl \
  --run-name objectverse-diary-qwen15b-lora-v2 \
  --max-steps 120 \
  --learning-rate 1e-4 \
  --max-seq-length 1536 \
  --lora-r 32 \
  --lora-alpha 64 \
  --per-device-train-batch-size 2 \
  --gradient-accumulation-steps 4 \
  --eval-ratio 0.1 \
  --eval-steps 20

Executed v2 training command:

modal run --timestamps -n objectverse-diary-qwen15b-lora-v2 scripts/finetune_lora.py \
  --dataset data/train/objectverse_sft_curated_v2.jsonl \
  --run-name objectverse-diary-qwen15b-lora-v2 \
  --max-steps 120 \
  --learning-rate 1e-4 \
  --max-seq-length 1536 \
  --lora-r 32 \
  --lora-alpha 64 \
  --per-device-train-batch-size 2 \
  --gradient-accumulation-steps 4 \
  --eval-ratio 0.1 \
  --eval-steps 20

Current Modal status: the v2 job completed successfully and produced the published LoRA adapter at https://huggingface.co/qqyule/objectverse-diary-qwen15b-lora.

Current v2 run summary:

  • run name: objectverse-diary-qwen15b-lora-v2
  • dataset: data/train/objectverse_sft_curated_v2.jsonl
  • dataset repo path: objectverse_sft_curated_v2.jsonl
  • records: 200 total, 180 train, 20 eval
  • base model: Qwen/Qwen2.5-1.5B-Instruct
  • max steps: 120
  • learning rate: 1e-4
  • max sequence length: 1536
  • LoRA rank / alpha / dropout: 32 / 64 / 0.05
  • effective batch size: 8
  • assistant-output-only loss: enabled
  • train loss: 0.3240
  • eval loss: 0.0162
  • train runtime: 140.3364s
  • epoch: 5.2222
  • local adapter export: ignored exports/objectverse-diary-qwen15b-lora-v2-adapter-dir/
  • model repo: https://huggingface.co/qqyule/objectverse-diary-qwen15b-lora

Additional v2 scaffold validation run: objectverse-diary-qwen15b-lora-v2-curated50-retry1 completed on Modal with the existing 50-row curated dataset, using assistant-output-only loss, 45 train rows, 5 eval rows, max_steps=120, learning_rate=1e-4, max_seq_length=1536, LoRA r=32, alpha=64, and effective batch size 8. Metrics: train_loss=0.2551, eval_loss=0.0093, train_runtime=146.5398s, epoch=20.0. The adapter was downloaded to ignored local exports/; it has not been published to Hugging Face Hub.

Default training scaffold settings:

  • base model: Qwen/Qwen2.5-1.5B-Instruct
  • LoRA adapter target: persona and diary JSON output
  • default loss: assistant-output-only labels, with prompt tokens masked
  • default eval split: 10% when the dataset has at least two rows
  • GPU: Modal A10G
  • output: Modal Volume artifacts, not committed files

The current objectverse_sft_preview.jsonl file is mock-generated and should only be used to validate the training pipeline. It is not final Well-Tuned evidence. Do not store Modal credit codes, tokens, Hugging Face tokens, or private datasets in the repo.

The published objectverse_sft_curated_v2.jsonl dataset is synthetic curated training data. It is suitable for hackathon training evidence, but it should still be described honestly as deterministic synthetic curation rather than real user trace data.

Curation Checklist

  • Persona stays consistent with the object.
  • Diary is short, vivid, and English-first.
  • Chinese helper text is secondary.
  • Output has a strange object archive feeling.
  • No real person, email, token, address, credit code, or serial number remains.
  • No commercial cloud AI model was used to create the sample.
  • JSON parses cleanly.

Publishing Notes

When publishing to Hugging Face Datasets:

  • create a dataset card
  • document that mock preview rows are synthetic
  • separate curated rows from raw candidates
  • include license and privacy notes
  • keep private images out of the repo

Curated v2 was published with:

.venv/bin/python -B scripts/publish_hf_dataset.py \
  --dataset-file data/train/objectverse_sft_curated_v2.jsonl \
  --repo-id qqyule/objectverse-diary-sft-curated \
  --path-in-repo objectverse_sft_curated_v2.jsonl \
  --commit-message "Upload Objectverse Diary curated v2 dataset"