ObjectverseDiary / docs /DATASET.md
qqyule's picture
Deploy latest Objectverse Diary from fa09aac
dd6cefc verified
# Dataset Plan
## Status
The project now has a deterministic SFT preview generator for local planning and schema validation.
Current preview artifact:
```bash
.venv/bin/python -B scripts/generate_dataset.py
```
Default output:
```text
data/train/objectverse_sft_preview.jsonl
```
This preview is mock-generated. It is not a final training dataset and should not be described as real model output.
The preview JSONL file is evidence for schema and workflow readiness only.
Curated v1 training-test artifact:
```bash
.venv/bin/python -B scripts/prepare_curated_dataset.py \
--count 50 \
--output data/train/objectverse_sft_curated.jsonl
```
This file is synthetic curated data: hand-shaped, deterministic, privacy-safe, and useful for testing the LoRA pipeline. It is not based on private user photos or commercial AI output.
Published synthetic curated dataset:
```text
https://huggingface.co/datasets/qqyule/objectverse-diary-sft-curated
```
Current curated v2 artifact:
```bash
.venv/bin/python -B scripts/prepare_curated_dataset.py \
--version v2 \
--count 200 \
--output data/train/objectverse_sft_curated_v2.jsonl
```
The published dataset repo now includes `objectverse_sft_curated_v2.jsonl`: 200 synthetic curated rows covering 40 everyday objects and 5 personality modes, with exactly 40 rows per mode and no repeated object-mode pair. The v1 file remains preserved through repository history.
## Target Dataset
Target before stronger fine-tuning:
- 200-500 generated or curated object-persona-diary samples
- at least 50 manually curated high-quality samples
- no private user photos
- no emails, tokens, serial numbers, or other sensitive identifiers
- English-first output with optional Chinese helper text
## JSONL Schema
Each line is one training candidate:
```json
{
"id": "sft-preview-0001",
"source": "objectverse-diary-mock-mvp",
"split": "preview",
"mode": "Cynical",
"object_description": "old white coffee mug on a developer desk",
"object_understanding": {},
"messages": [
{"role": "system", "content": "..."},
{"role": "user", "content": "..."},
{"role": "assistant", "content": "{\"persona\":{},\"diary\":{}}"}
]
}
```
The `assistant.content` field is JSON text so it can be used with chat-style SFT tools while preserving structured outputs.
## Generation Workflow
Preview:
```bash
.venv/bin/python -B scripts/generate_dataset.py --count 60
```
Full candidate pool later:
```bash
.venv/bin/python -B scripts/generate_dataset.py --count 300 --output data/train/objectverse_sft_candidates.jsonl
```
Manual curation should happen after generation. For a stronger LoRA run, curate 150-300 rows from a broader object/mode/scene pool and leave 10-15% for evaluation. Do not publish the full candidate file until it has been reviewed.
Space VLM validation traces under `data/traces/space-vlm/` are failure evidence because they include `vision-fallback-to-mock`. Do not mix them into curated training data or describe them as successful real VLM outputs.
## Modal LoRA Training Scaffold
The repository includes a Modal training scaffold for the future Well-Tuned path. It is not run by default and does not affect the Gradio Space runtime.
Install the local Modal CLI dependency separately:
```bash
pip install -r requirements-training.txt
```
Validate the local JSONL shape without Modal auth or GPU usage:
```bash
.venv/bin/python -B scripts/finetune_lora.py \
--dry-run \
--dataset data/train/objectverse_sft_curated.jsonl \
--run-name objectverse-diary-qwen15b-curated-test
```
The first badge-evidence run used 20 steps on 50 synthetic curated rows. For a higher-quality v2 run, validate the larger curated file first:
```bash
.venv/bin/python -B scripts/finetune_lora.py \
--dry-run \
--dataset data/train/objectverse_sft_curated_v2.jsonl \
--run-name objectverse-diary-qwen15b-lora-v2 \
--max-steps 120 \
--learning-rate 1e-4 \
--max-seq-length 1536 \
--lora-r 32 \
--lora-alpha 64 \
--per-device-train-batch-size 2 \
--gradient-accumulation-steps 4 \
--eval-ratio 0.1 \
--eval-steps 20
```
Executed v2 training command:
```bash
modal run --timestamps -n objectverse-diary-qwen15b-lora-v2 scripts/finetune_lora.py \
--dataset data/train/objectverse_sft_curated_v2.jsonl \
--run-name objectverse-diary-qwen15b-lora-v2 \
--max-steps 120 \
--learning-rate 1e-4 \
--max-seq-length 1536 \
--lora-r 32 \
--lora-alpha 64 \
--per-device-train-batch-size 2 \
--gradient-accumulation-steps 4 \
--eval-ratio 0.1 \
--eval-steps 20
```
Current Modal status: the v2 job completed successfully and produced the published LoRA adapter at `https://huggingface.co/qqyule/objectverse-diary-qwen15b-lora`.
Current v2 run summary:
- run name: `objectverse-diary-qwen15b-lora-v2`
- dataset: `data/train/objectverse_sft_curated_v2.jsonl`
- dataset repo path: `objectverse_sft_curated_v2.jsonl`
- records: 200 total, 180 train, 20 eval
- base model: `Qwen/Qwen2.5-1.5B-Instruct`
- max steps: 120
- learning rate: `1e-4`
- max sequence length: 1536
- LoRA rank / alpha / dropout: 32 / 64 / 0.05
- effective batch size: 8
- assistant-output-only loss: enabled
- train loss: 0.3240
- eval loss: 0.0162
- train runtime: 140.3364s
- epoch: 5.2222
- local adapter export: ignored `exports/objectverse-diary-qwen15b-lora-v2-adapter-dir/`
- model repo: `https://huggingface.co/qqyule/objectverse-diary-qwen15b-lora`
Additional v2 scaffold validation run: `objectverse-diary-qwen15b-lora-v2-curated50-retry1` completed on Modal with the existing 50-row curated dataset, using assistant-output-only loss, 45 train rows, 5 eval rows, `max_steps=120`, `learning_rate=1e-4`, `max_seq_length=1536`, LoRA `r=32`, `alpha=64`, and effective batch size 8. Metrics: `train_loss=0.2551`, `eval_loss=0.0093`, `train_runtime=146.5398s`, `epoch=20.0`. The adapter was downloaded to ignored local `exports/`; it has not been published to Hugging Face Hub.
Default training scaffold settings:
- base model: `Qwen/Qwen2.5-1.5B-Instruct`
- LoRA adapter target: persona and diary JSON output
- default loss: assistant-output-only labels, with prompt tokens masked
- default eval split: 10% when the dataset has at least two rows
- GPU: Modal `A10G`
- output: Modal Volume artifacts, not committed files
The current `objectverse_sft_preview.jsonl` file is mock-generated and should only be used to validate the training pipeline. It is not final Well-Tuned evidence. Do not store Modal credit codes, tokens, Hugging Face tokens, or private datasets in the repo.
The published `objectverse_sft_curated_v2.jsonl` dataset is synthetic curated training data. It is suitable for hackathon training evidence, but it should still be described honestly as deterministic synthetic curation rather than real user trace data.
## Curation Checklist
- Persona stays consistent with the object.
- Diary is short, vivid, and English-first.
- Chinese helper text is secondary.
- Output has a strange object archive feeling.
- No real person, email, token, address, credit code, or serial number remains.
- No commercial cloud AI model was used to create the sample.
- JSON parses cleanly.
## Publishing Notes
When publishing to Hugging Face Datasets:
- create a dataset card
- document that mock preview rows are synthetic
- separate curated rows from raw candidates
- include license and privacy notes
- keep private images out of the repo
Curated v2 was published with:
```bash
.venv/bin/python -B scripts/publish_hf_dataset.py \
--dataset-file data/train/objectverse_sft_curated_v2.jsonl \
--repo-id qqyule/objectverse-diary-sft-curated \
--path-in-repo objectverse_sft_curated_v2.jsonl \
--commit-message "Upload Objectverse Diary curated v2 dataset"
```