File size: 7,766 Bytes
bc02199
 
 
 
 
 
dd6cefc
bc02199
 
 
 
 
 
 
 
 
 
 
 
 
dd6cefc
1e2c036
dd6cefc
9e874de
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dd6cefc
 
 
 
 
 
 
 
 
 
 
bc02199
 
dd6cefc
bc02199
dd6cefc
bc02199
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dd6cefc
bc02199
1e2c036
 
9e874de
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dd6cefc
9e874de
 
dd6cefc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9e874de
 
dd6cefc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9e874de
 
 
 
 
dd6cefc
 
9e874de
 
 
 
 
dd6cefc
9e874de
bc02199
 
 
 
 
 
 
 
 
 
 
 
dd6cefc
bc02199
 
 
 
 
 
dd6cefc
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
# Dataset Plan

## Status

The project now has a deterministic SFT preview generator for local planning and schema validation.

Current preview artifact:

```bash
.venv/bin/python -B scripts/generate_dataset.py
```

Default output:

```text
data/train/objectverse_sft_preview.jsonl
```

This preview is mock-generated. It is not a final training dataset and should not be described as real model output.

The preview JSONL file is evidence for schema and workflow readiness only.

Curated v1 training-test artifact:

```bash
.venv/bin/python -B scripts/prepare_curated_dataset.py \
  --count 50 \
  --output data/train/objectverse_sft_curated.jsonl
```

This file is synthetic curated data: hand-shaped, deterministic, privacy-safe, and useful for testing the LoRA pipeline. It is not based on private user photos or commercial AI output.

Published synthetic curated dataset:

```text
https://huggingface.co/datasets/qqyule/objectverse-diary-sft-curated
```

Current curated v2 artifact:

```bash
.venv/bin/python -B scripts/prepare_curated_dataset.py \
  --version v2 \
  --count 200 \
  --output data/train/objectverse_sft_curated_v2.jsonl
```

The published dataset repo now includes `objectverse_sft_curated_v2.jsonl`: 200 synthetic curated rows covering 40 everyday objects and 5 personality modes, with exactly 40 rows per mode and no repeated object-mode pair. The v1 file remains preserved through repository history.

## Target Dataset

Target before stronger fine-tuning:

- 200-500 generated or curated object-persona-diary samples
- at least 50 manually curated high-quality samples
- no private user photos
- no emails, tokens, serial numbers, or other sensitive identifiers
- English-first output with optional Chinese helper text

## JSONL Schema

Each line is one training candidate:

```json
{
  "id": "sft-preview-0001",
  "source": "objectverse-diary-mock-mvp",
  "split": "preview",
  "mode": "Cynical",
  "object_description": "old white coffee mug on a developer desk",
  "object_understanding": {},
  "messages": [
    {"role": "system", "content": "..."},
    {"role": "user", "content": "..."},
    {"role": "assistant", "content": "{\"persona\":{},\"diary\":{}}"}
  ]
}
```

The `assistant.content` field is JSON text so it can be used with chat-style SFT tools while preserving structured outputs.

## Generation Workflow

Preview:

```bash
.venv/bin/python -B scripts/generate_dataset.py --count 60
```

Full candidate pool later:

```bash
.venv/bin/python -B scripts/generate_dataset.py --count 300 --output data/train/objectverse_sft_candidates.jsonl
```

Manual curation should happen after generation. For a stronger LoRA run, curate 150-300 rows from a broader object/mode/scene pool and leave 10-15% for evaluation. Do not publish the full candidate file until it has been reviewed.

Space VLM validation traces under `data/traces/space-vlm/` are failure evidence because they include `vision-fallback-to-mock`. Do not mix them into curated training data or describe them as successful real VLM outputs.

## Modal LoRA Training Scaffold

The repository includes a Modal training scaffold for the future Well-Tuned path. It is not run by default and does not affect the Gradio Space runtime.

Install the local Modal CLI dependency separately:

```bash
pip install -r requirements-training.txt
```

Validate the local JSONL shape without Modal auth or GPU usage:

```bash
.venv/bin/python -B scripts/finetune_lora.py \
  --dry-run \
  --dataset data/train/objectverse_sft_curated.jsonl \
  --run-name objectverse-diary-qwen15b-curated-test
```

The first badge-evidence run used 20 steps on 50 synthetic curated rows. For a higher-quality v2 run, validate the larger curated file first:

```bash
.venv/bin/python -B scripts/finetune_lora.py \
  --dry-run \
  --dataset data/train/objectverse_sft_curated_v2.jsonl \
  --run-name objectverse-diary-qwen15b-lora-v2 \
  --max-steps 120 \
  --learning-rate 1e-4 \
  --max-seq-length 1536 \
  --lora-r 32 \
  --lora-alpha 64 \
  --per-device-train-batch-size 2 \
  --gradient-accumulation-steps 4 \
  --eval-ratio 0.1 \
  --eval-steps 20
```

Executed v2 training command:

```bash
modal run --timestamps -n objectverse-diary-qwen15b-lora-v2 scripts/finetune_lora.py \
  --dataset data/train/objectverse_sft_curated_v2.jsonl \
  --run-name objectverse-diary-qwen15b-lora-v2 \
  --max-steps 120 \
  --learning-rate 1e-4 \
  --max-seq-length 1536 \
  --lora-r 32 \
  --lora-alpha 64 \
  --per-device-train-batch-size 2 \
  --gradient-accumulation-steps 4 \
  --eval-ratio 0.1 \
  --eval-steps 20
```

Current Modal status: the v2 job completed successfully and produced the published LoRA adapter at `https://huggingface.co/qqyule/objectverse-diary-qwen15b-lora`.

Current v2 run summary:

- run name: `objectverse-diary-qwen15b-lora-v2`
- dataset: `data/train/objectverse_sft_curated_v2.jsonl`
- dataset repo path: `objectverse_sft_curated_v2.jsonl`
- records: 200 total, 180 train, 20 eval
- base model: `Qwen/Qwen2.5-1.5B-Instruct`
- max steps: 120
- learning rate: `1e-4`
- max sequence length: 1536
- LoRA rank / alpha / dropout: 32 / 64 / 0.05
- effective batch size: 8
- assistant-output-only loss: enabled
- train loss: 0.3240
- eval loss: 0.0162
- train runtime: 140.3364s
- epoch: 5.2222
- local adapter export: ignored `exports/objectverse-diary-qwen15b-lora-v2-adapter-dir/`
- model repo: `https://huggingface.co/qqyule/objectverse-diary-qwen15b-lora`

Additional v2 scaffold validation run: `objectverse-diary-qwen15b-lora-v2-curated50-retry1` completed on Modal with the existing 50-row curated dataset, using assistant-output-only loss, 45 train rows, 5 eval rows, `max_steps=120`, `learning_rate=1e-4`, `max_seq_length=1536`, LoRA `r=32`, `alpha=64`, and effective batch size 8. Metrics: `train_loss=0.2551`, `eval_loss=0.0093`, `train_runtime=146.5398s`, `epoch=20.0`. The adapter was downloaded to ignored local `exports/`; it has not been published to Hugging Face Hub.

Default training scaffold settings:

- base model: `Qwen/Qwen2.5-1.5B-Instruct`
- LoRA adapter target: persona and diary JSON output
- default loss: assistant-output-only labels, with prompt tokens masked
- default eval split: 10% when the dataset has at least two rows
- GPU: Modal `A10G`
- output: Modal Volume artifacts, not committed files

The current `objectverse_sft_preview.jsonl` file is mock-generated and should only be used to validate the training pipeline. It is not final Well-Tuned evidence. Do not store Modal credit codes, tokens, Hugging Face tokens, or private datasets in the repo.

The published `objectverse_sft_curated_v2.jsonl` dataset is synthetic curated training data. It is suitable for hackathon training evidence, but it should still be described honestly as deterministic synthetic curation rather than real user trace data.

## Curation Checklist

- Persona stays consistent with the object.
- Diary is short, vivid, and English-first.
- Chinese helper text is secondary.
- Output has a strange object archive feeling.
- No real person, email, token, address, credit code, or serial number remains.
- No commercial cloud AI model was used to create the sample.
- JSON parses cleanly.

## Publishing Notes

When publishing to Hugging Face Datasets:

- create a dataset card
- document that mock preview rows are synthetic
- separate curated rows from raw candidates
- include license and privacy notes
- keep private images out of the repo

Curated v2 was published with:

```bash
.venv/bin/python -B scripts/publish_hf_dataset.py \
  --dataset-file data/train/objectverse_sft_curated_v2.jsonl \
  --repo-id qqyule/objectverse-diary-sft-curated \
  --path-in-repo objectverse_sft_curated_v2.jsonl \
  --commit-message "Upload Objectverse Diary curated v2 dataset"
```