Spaces:

build-small-hackathon
/

ObjectverseDiary

Running on Zero

App Files Files Community

ObjectverseDiary / docs /DATASET.md

qqyule

Deploy latest Objectverse Diary from fa09aac

dd6cefc verified 3 days ago

preview code

raw

history blame contribute delete

7.77 kB

A newer version of the Gradio SDK is available: 6.17.3

Upgrade

Dataset Plan

Status

The project now has a deterministic SFT preview generator for local planning and schema validation.

Current preview artifact:

.venv/bin/python -B scripts/generate_dataset.py

Default output:

data/train/objectverse_sft_preview.jsonl

This preview is mock-generated. It is not a final training dataset and should not be described as real model output.

The preview JSONL file is evidence for schema and workflow readiness only.

Curated v1 training-test artifact:

.venv/bin/python -B scripts/prepare_curated_dataset.py \
  --count 50 \
  --output data/train/objectverse_sft_curated.jsonl

This file is synthetic curated data: hand-shaped, deterministic, privacy-safe, and useful for testing the LoRA pipeline. It is not based on private user photos or commercial AI output.

Published synthetic curated dataset:

https://huggingface.co/datasets/qqyule/objectverse-diary-sft-curated

Current curated v2 artifact:

.venv/bin/python -B scripts/prepare_curated_dataset.py \
  --version v2 \
  --count 200 \
  --output data/train/objectverse_sft_curated_v2.jsonl

The published dataset repo now includes objectverse_sft_curated_v2.jsonl: 200 synthetic curated rows covering 40 everyday objects and 5 personality modes, with exactly 40 rows per mode and no repeated object-mode pair. The v1 file remains preserved through repository history.

Target Dataset

Target before stronger fine-tuning:

200-500 generated or curated object-persona-diary samples
at least 50 manually curated high-quality samples
no private user photos
no emails, tokens, serial numbers, or other sensitive identifiers
English-first output with optional Chinese helper text

JSONL Schema

Each line is one training candidate:

{
  "id": "sft-preview-0001",
  "source": "objectverse-diary-mock-mvp",
  "split": "preview",
  "mode": "Cynical",
  "object_description": "old white coffee mug on a developer desk",
  "object_understanding": {},
  "messages": [
    {"role": "system", "content": "..."},
    {"role": "user", "content": "..."},
    {"role": "assistant", "content": "{\"persona\":{},\"diary\":{}}"}
  ]
}

The assistant.content field is JSON text so it can be used with chat-style SFT tools while preserving structured outputs.

Generation Workflow

Preview:

.venv/bin/python -B scripts/generate_dataset.py --count 60

Full candidate pool later:

.venv/bin/python -B scripts/generate_dataset.py --count 300 --output data/train/objectverse_sft_candidates.jsonl

Manual curation should happen after generation. For a stronger LoRA run, curate 150-300 rows from a broader object/mode/scene pool and leave 10-15% for evaluation. Do not publish the full candidate file until it has been reviewed.

Space VLM validation traces under data/traces/space-vlm/ are failure evidence because they include vision-fallback-to-mock. Do not mix them into curated training data or describe them as successful real VLM outputs.

Modal LoRA Training Scaffold

The repository includes a Modal training scaffold for the future Well-Tuned path. It is not run by default and does not affect the Gradio Space runtime.

Install the local Modal CLI dependency separately:

pip install -r requirements-training.txt

Validate the local JSONL shape without Modal auth or GPU usage:

.venv/bin/python -B scripts/finetune_lora.py \
  --dry-run \
  --dataset data/train/objectverse_sft_curated.jsonl \
  --run-name objectverse-diary-qwen15b-curated-test

The first badge-evidence run used 20 steps on 50 synthetic curated rows. For a higher-quality v2 run, validate the larger curated file first:

.venv/bin/python -B scripts/finetune_lora.py \
  --dry-run \
  --dataset data/train/objectverse_sft_curated_v2.jsonl \
  --run-name objectverse-diary-qwen15b-lora-v2 \
  --max-steps 120 \
  --learning-rate 1e-4 \
  --max-seq-length 1536 \
  --lora-r 32 \
  --lora-alpha 64 \
  --per-device-train-batch-size 2 \
  --gradient-accumulation-steps 4 \
  --eval-ratio 0.1 \
  --eval-steps 20

Executed v2 training command:

modal run --timestamps -n objectverse-diary-qwen15b-lora-v2 scripts/finetune_lora.py \
  --dataset data/train/objectverse_sft_curated_v2.jsonl \
  --run-name objectverse-diary-qwen15b-lora-v2 \
  --max-steps 120 \
  --learning-rate 1e-4 \
  --max-seq-length 1536 \
  --lora-r 32 \
  --lora-alpha 64 \
  --per-device-train-batch-size 2 \
  --gradient-accumulation-steps 4 \
  --eval-ratio 0.1 \
  --eval-steps 20

Current Modal status: the v2 job completed successfully and produced the published LoRA adapter at https://huggingface.co/qqyule/objectverse-diary-qwen15b-lora.

Current v2 run summary:

run name: objectverse-diary-qwen15b-lora-v2
dataset: data/train/objectverse_sft_curated_v2.jsonl
dataset repo path: objectverse_sft_curated_v2.jsonl
records: 200 total, 180 train, 20 eval
base model: Qwen/Qwen2.5-1.5B-Instruct
max steps: 120
learning rate: 1e-4
max sequence length: 1536
LoRA rank / alpha / dropout: 32 / 64 / 0.05
effective batch size: 8
assistant-output-only loss: enabled
train loss: 0.3240
eval loss: 0.0162
train runtime: 140.3364s
epoch: 5.2222
local adapter export: ignored exports/objectverse-diary-qwen15b-lora-v2-adapter-dir/
model repo: https://huggingface.co/qqyule/objectverse-diary-qwen15b-lora

Additional v2 scaffold validation run: objectverse-diary-qwen15b-lora-v2-curated50-retry1 completed on Modal with the existing 50-row curated dataset, using assistant-output-only loss, 45 train rows, 5 eval rows, max_steps=120, learning_rate=1e-4, max_seq_length=1536, LoRA r=32, alpha=64, and effective batch size 8. Metrics: train_loss=0.2551, eval_loss=0.0093, train_runtime=146.5398s, epoch=20.0. The adapter was downloaded to ignored local exports/; it has not been published to Hugging Face Hub.

Default training scaffold settings:

base model: Qwen/Qwen2.5-1.5B-Instruct
LoRA adapter target: persona and diary JSON output
default loss: assistant-output-only labels, with prompt tokens masked
default eval split: 10% when the dataset has at least two rows
GPU: Modal A10G
output: Modal Volume artifacts, not committed files

The current objectverse_sft_preview.jsonl file is mock-generated and should only be used to validate the training pipeline. It is not final Well-Tuned evidence. Do not store Modal credit codes, tokens, Hugging Face tokens, or private datasets in the repo.

The published objectverse_sft_curated_v2.jsonl dataset is synthetic curated training data. It is suitable for hackathon training evidence, but it should still be described honestly as deterministic synthetic curation rather than real user trace data.

Curation Checklist

Persona stays consistent with the object.
Diary is short, vivid, and English-first.
Chinese helper text is secondary.
Output has a strange object archive feeling.
No real person, email, token, address, credit code, or serial number remains.
No commercial cloud AI model was used to create the sample.
JSON parses cleanly.

Publishing Notes

When publishing to Hugging Face Datasets:

create a dataset card
document that mock preview rows are synthetic
separate curated rows from raw candidates
include license and privacy notes
keep private images out of the repo

Curated v2 was published with:

.venv/bin/python -B scripts/publish_hf_dataset.py \
  --dataset-file data/train/objectverse_sft_curated_v2.jsonl \
  --repo-id qqyule/objectverse-diary-sft-curated \
  --path-in-repo objectverse_sft_curated_v2.jsonl \
  --commit-message "Upload Objectverse Diary curated v2 dataset"