Spaces:

build-small-hackathon
/

ObjectverseDiary

Running on Zero

App Files Files Community

ObjectverseDiary / docs /DATASET.md

qqyule

Deploy latest Objectverse Diary from fa09aac

dd6cefc verified 3 days ago

preview code

raw

history blame contribute delete

7.77 kB

	# Dataset Plan

	## Status

	The project now has a deterministic SFT preview generator for local planning and schema validation.

	Current preview artifact:

	```bash
	.venv/bin/python -B scripts/generate_dataset.py
	```

	Default output:

	```text
	data/train/objectverse_sft_preview.jsonl
	```

	This preview is mock-generated. It is not a final training dataset and should not be described as real model output.

	The preview JSONL file is evidence for schema and workflow readiness only.

	Curated v1 training-test artifact:

	```bash
	.venv/bin/python -B scripts/prepare_curated_dataset.py \
	--count 50 \
	--output data/train/objectverse_sft_curated.jsonl
	```

	This file is synthetic curated data: hand-shaped, deterministic, privacy-safe, and useful for testing the LoRA pipeline. It is not based on private user photos or commercial AI output.

	Published synthetic curated dataset:

	```text
	https://huggingface.co/datasets/qqyule/objectverse-diary-sft-curated
	```

	Current curated v2 artifact:

	```bash
	.venv/bin/python -B scripts/prepare_curated_dataset.py \
	--version v2 \
	--count 200 \
	--output data/train/objectverse_sft_curated_v2.jsonl
	```

	The published dataset repo now includes `objectverse_sft_curated_v2.jsonl`: 200 synthetic curated rows covering 40 everyday objects and 5 personality modes, with exactly 40 rows per mode and no repeated object-mode pair. The v1 file remains preserved through repository history.

	## Target Dataset

	Target before stronger fine-tuning:

	- 200-500 generated or curated object-persona-diary samples
	- at least 50 manually curated high-quality samples
	- no private user photos
	- no emails, tokens, serial numbers, or other sensitive identifiers
	- English-first output with optional Chinese helper text

	## JSONL Schema

	Each line is one training candidate:

	```json
	{
	"id": "sft-preview-0001",
	"source": "objectverse-diary-mock-mvp",
	"split": "preview",
	"mode": "Cynical",
	"object_description": "old white coffee mug on a developer desk",
	"object_understanding": {},
	"messages": [
	{"role": "system", "content": "..."},
	{"role": "user", "content": "..."},
	{"role": "assistant", "content": "{\"persona\":{},\"diary\":{}}"}
	]
	}
	```

	The `assistant.content` field is JSON text so it can be used with chat-style SFT tools while preserving structured outputs.

	## Generation Workflow

	Preview:

	```bash
	.venv/bin/python -B scripts/generate_dataset.py --count 60
	```

	Full candidate pool later:

	```bash
	.venv/bin/python -B scripts/generate_dataset.py --count 300 --output data/train/objectverse_sft_candidates.jsonl
	```

	Manual curation should happen after generation. For a stronger LoRA run, curate 150-300 rows from a broader object/mode/scene pool and leave 10-15% for evaluation. Do not publish the full candidate file until it has been reviewed.

	Space VLM validation traces under `data/traces/space-vlm/` are failure evidence because they include `vision-fallback-to-mock`. Do not mix them into curated training data or describe them as successful real VLM outputs.

	## Modal LoRA Training Scaffold

	The repository includes a Modal training scaffold for the future Well-Tuned path. It is not run by default and does not affect the Gradio Space runtime.

	Install the local Modal CLI dependency separately:

	```bash
	pip install -r requirements-training.txt
	```

	Validate the local JSONL shape without Modal auth or GPU usage:

	```bash
	.venv/bin/python -B scripts/finetune_lora.py \
	--dry-run \
	--dataset data/train/objectverse_sft_curated.jsonl \
	--run-name objectverse-diary-qwen15b-curated-test
	```

	The first badge-evidence run used 20 steps on 50 synthetic curated rows. For a higher-quality v2 run, validate the larger curated file first:

	```bash
	.venv/bin/python -B scripts/finetune_lora.py \
	--dry-run \
	--dataset data/train/objectverse_sft_curated_v2.jsonl \
	--run-name objectverse-diary-qwen15b-lora-v2 \
	--max-steps 120 \
	--learning-rate 1e-4 \
	--max-seq-length 1536 \
	--lora-r 32 \
	--lora-alpha 64 \
	--per-device-train-batch-size 2 \
	--gradient-accumulation-steps 4 \
	--eval-ratio 0.1 \
	--eval-steps 20
	```

	Executed v2 training command:

	```bash
	modal run --timestamps -n objectverse-diary-qwen15b-lora-v2 scripts/finetune_lora.py \
	--dataset data/train/objectverse_sft_curated_v2.jsonl \
	--run-name objectverse-diary-qwen15b-lora-v2 \
	--max-steps 120 \
	--learning-rate 1e-4 \
	--max-seq-length 1536 \
	--lora-r 32 \
	--lora-alpha 64 \
	--per-device-train-batch-size 2 \
	--gradient-accumulation-steps 4 \
	--eval-ratio 0.1 \
	--eval-steps 20
	```

	Current Modal status: the v2 job completed successfully and produced the published LoRA adapter at `https://huggingface.co/qqyule/objectverse-diary-qwen15b-lora`.

	Current v2 run summary:

	- run name: `objectverse-diary-qwen15b-lora-v2`
	- dataset: `data/train/objectverse_sft_curated_v2.jsonl`
	- dataset repo path: `objectverse_sft_curated_v2.jsonl`
	- records: 200 total, 180 train, 20 eval
	- base model: `Qwen/Qwen2.5-1.5B-Instruct`
	- max steps: 120
	- learning rate: `1e-4`
	- max sequence length: 1536
	- LoRA rank / alpha / dropout: 32 / 64 / 0.05
	- effective batch size: 8
	- assistant-output-only loss: enabled
	- train loss: 0.3240
	- eval loss: 0.0162
	- train runtime: 140.3364s
	- epoch: 5.2222
	- local adapter export: ignored `exports/objectverse-diary-qwen15b-lora-v2-adapter-dir/`
	- model repo: `https://huggingface.co/qqyule/objectverse-diary-qwen15b-lora`

	Additional v2 scaffold validation run: `objectverse-diary-qwen15b-lora-v2-curated50-retry1` completed on Modal with the existing 50-row curated dataset, using assistant-output-only loss, 45 train rows, 5 eval rows, `max_steps=120`, `learning_rate=1e-4`, `max_seq_length=1536`, LoRA `r=32`, `alpha=64`, and effective batch size 8. Metrics: `train_loss=0.2551`, `eval_loss=0.0093`, `train_runtime=146.5398s`, `epoch=20.0`. The adapter was downloaded to ignored local `exports/`; it has not been published to Hugging Face Hub.

	Default training scaffold settings:

	- base model: `Qwen/Qwen2.5-1.5B-Instruct`
	- LoRA adapter target: persona and diary JSON output
	- default loss: assistant-output-only labels, with prompt tokens masked
	- default eval split: 10% when the dataset has at least two rows
	- GPU: Modal `A10G`
	- output: Modal Volume artifacts, not committed files

	The current `objectverse_sft_preview.jsonl` file is mock-generated and should only be used to validate the training pipeline. It is not final Well-Tuned evidence. Do not store Modal credit codes, tokens, Hugging Face tokens, or private datasets in the repo.

	The published `objectverse_sft_curated_v2.jsonl` dataset is synthetic curated training data. It is suitable for hackathon training evidence, but it should still be described honestly as deterministic synthetic curation rather than real user trace data.

	## Curation Checklist

	- Persona stays consistent with the object.
	- Diary is short, vivid, and English-first.
	- Chinese helper text is secondary.
	- Output has a strange object archive feeling.
	- No real person, email, token, address, credit code, or serial number remains.
	- No commercial cloud AI model was used to create the sample.
	- JSON parses cleanly.

	## Publishing Notes

	When publishing to Hugging Face Datasets:

	- create a dataset card
	- document that mock preview rows are synthetic
	- separate curated rows from raw candidates
	- include license and privacy notes
	- keep private images out of the repo

	Curated v2 was published with:

	```bash
	.venv/bin/python -B scripts/publish_hf_dataset.py \
	--dataset-file data/train/objectverse_sft_curated_v2.jsonl \
	--repo-id qqyule/objectverse-diary-sft-curated \
	--path-in-repo objectverse_sft_curated_v2.jsonl \
	--commit-message "Upload Objectverse Diary curated v2 dataset"
	```