Spaces:
Running on Zero
Running on Zero
File size: 7,766 Bytes
bc02199 dd6cefc bc02199 dd6cefc 1e2c036 dd6cefc 9e874de dd6cefc bc02199 dd6cefc bc02199 dd6cefc bc02199 dd6cefc bc02199 1e2c036 9e874de dd6cefc 9e874de dd6cefc 9e874de dd6cefc 9e874de dd6cefc 9e874de dd6cefc 9e874de bc02199 dd6cefc bc02199 dd6cefc | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 | # Dataset Plan
## Status
The project now has a deterministic SFT preview generator for local planning and schema validation.
Current preview artifact:
```bash
.venv/bin/python -B scripts/generate_dataset.py
```
Default output:
```text
data/train/objectverse_sft_preview.jsonl
```
This preview is mock-generated. It is not a final training dataset and should not be described as real model output.
The preview JSONL file is evidence for schema and workflow readiness only.
Curated v1 training-test artifact:
```bash
.venv/bin/python -B scripts/prepare_curated_dataset.py \
--count 50 \
--output data/train/objectverse_sft_curated.jsonl
```
This file is synthetic curated data: hand-shaped, deterministic, privacy-safe, and useful for testing the LoRA pipeline. It is not based on private user photos or commercial AI output.
Published synthetic curated dataset:
```text
https://huggingface.co/datasets/qqyule/objectverse-diary-sft-curated
```
Current curated v2 artifact:
```bash
.venv/bin/python -B scripts/prepare_curated_dataset.py \
--version v2 \
--count 200 \
--output data/train/objectverse_sft_curated_v2.jsonl
```
The published dataset repo now includes `objectverse_sft_curated_v2.jsonl`: 200 synthetic curated rows covering 40 everyday objects and 5 personality modes, with exactly 40 rows per mode and no repeated object-mode pair. The v1 file remains preserved through repository history.
## Target Dataset
Target before stronger fine-tuning:
- 200-500 generated or curated object-persona-diary samples
- at least 50 manually curated high-quality samples
- no private user photos
- no emails, tokens, serial numbers, or other sensitive identifiers
- English-first output with optional Chinese helper text
## JSONL Schema
Each line is one training candidate:
```json
{
"id": "sft-preview-0001",
"source": "objectverse-diary-mock-mvp",
"split": "preview",
"mode": "Cynical",
"object_description": "old white coffee mug on a developer desk",
"object_understanding": {},
"messages": [
{"role": "system", "content": "..."},
{"role": "user", "content": "..."},
{"role": "assistant", "content": "{\"persona\":{},\"diary\":{}}"}
]
}
```
The `assistant.content` field is JSON text so it can be used with chat-style SFT tools while preserving structured outputs.
## Generation Workflow
Preview:
```bash
.venv/bin/python -B scripts/generate_dataset.py --count 60
```
Full candidate pool later:
```bash
.venv/bin/python -B scripts/generate_dataset.py --count 300 --output data/train/objectverse_sft_candidates.jsonl
```
Manual curation should happen after generation. For a stronger LoRA run, curate 150-300 rows from a broader object/mode/scene pool and leave 10-15% for evaluation. Do not publish the full candidate file until it has been reviewed.
Space VLM validation traces under `data/traces/space-vlm/` are failure evidence because they include `vision-fallback-to-mock`. Do not mix them into curated training data or describe them as successful real VLM outputs.
## Modal LoRA Training Scaffold
The repository includes a Modal training scaffold for the future Well-Tuned path. It is not run by default and does not affect the Gradio Space runtime.
Install the local Modal CLI dependency separately:
```bash
pip install -r requirements-training.txt
```
Validate the local JSONL shape without Modal auth or GPU usage:
```bash
.venv/bin/python -B scripts/finetune_lora.py \
--dry-run \
--dataset data/train/objectverse_sft_curated.jsonl \
--run-name objectverse-diary-qwen15b-curated-test
```
The first badge-evidence run used 20 steps on 50 synthetic curated rows. For a higher-quality v2 run, validate the larger curated file first:
```bash
.venv/bin/python -B scripts/finetune_lora.py \
--dry-run \
--dataset data/train/objectverse_sft_curated_v2.jsonl \
--run-name objectverse-diary-qwen15b-lora-v2 \
--max-steps 120 \
--learning-rate 1e-4 \
--max-seq-length 1536 \
--lora-r 32 \
--lora-alpha 64 \
--per-device-train-batch-size 2 \
--gradient-accumulation-steps 4 \
--eval-ratio 0.1 \
--eval-steps 20
```
Executed v2 training command:
```bash
modal run --timestamps -n objectverse-diary-qwen15b-lora-v2 scripts/finetune_lora.py \
--dataset data/train/objectverse_sft_curated_v2.jsonl \
--run-name objectverse-diary-qwen15b-lora-v2 \
--max-steps 120 \
--learning-rate 1e-4 \
--max-seq-length 1536 \
--lora-r 32 \
--lora-alpha 64 \
--per-device-train-batch-size 2 \
--gradient-accumulation-steps 4 \
--eval-ratio 0.1 \
--eval-steps 20
```
Current Modal status: the v2 job completed successfully and produced the published LoRA adapter at `https://huggingface.co/qqyule/objectverse-diary-qwen15b-lora`.
Current v2 run summary:
- run name: `objectverse-diary-qwen15b-lora-v2`
- dataset: `data/train/objectverse_sft_curated_v2.jsonl`
- dataset repo path: `objectverse_sft_curated_v2.jsonl`
- records: 200 total, 180 train, 20 eval
- base model: `Qwen/Qwen2.5-1.5B-Instruct`
- max steps: 120
- learning rate: `1e-4`
- max sequence length: 1536
- LoRA rank / alpha / dropout: 32 / 64 / 0.05
- effective batch size: 8
- assistant-output-only loss: enabled
- train loss: 0.3240
- eval loss: 0.0162
- train runtime: 140.3364s
- epoch: 5.2222
- local adapter export: ignored `exports/objectverse-diary-qwen15b-lora-v2-adapter-dir/`
- model repo: `https://huggingface.co/qqyule/objectverse-diary-qwen15b-lora`
Additional v2 scaffold validation run: `objectverse-diary-qwen15b-lora-v2-curated50-retry1` completed on Modal with the existing 50-row curated dataset, using assistant-output-only loss, 45 train rows, 5 eval rows, `max_steps=120`, `learning_rate=1e-4`, `max_seq_length=1536`, LoRA `r=32`, `alpha=64`, and effective batch size 8. Metrics: `train_loss=0.2551`, `eval_loss=0.0093`, `train_runtime=146.5398s`, `epoch=20.0`. The adapter was downloaded to ignored local `exports/`; it has not been published to Hugging Face Hub.
Default training scaffold settings:
- base model: `Qwen/Qwen2.5-1.5B-Instruct`
- LoRA adapter target: persona and diary JSON output
- default loss: assistant-output-only labels, with prompt tokens masked
- default eval split: 10% when the dataset has at least two rows
- GPU: Modal `A10G`
- output: Modal Volume artifacts, not committed files
The current `objectverse_sft_preview.jsonl` file is mock-generated and should only be used to validate the training pipeline. It is not final Well-Tuned evidence. Do not store Modal credit codes, tokens, Hugging Face tokens, or private datasets in the repo.
The published `objectverse_sft_curated_v2.jsonl` dataset is synthetic curated training data. It is suitable for hackathon training evidence, but it should still be described honestly as deterministic synthetic curation rather than real user trace data.
## Curation Checklist
- Persona stays consistent with the object.
- Diary is short, vivid, and English-first.
- Chinese helper text is secondary.
- Output has a strange object archive feeling.
- No real person, email, token, address, credit code, or serial number remains.
- No commercial cloud AI model was used to create the sample.
- JSON parses cleanly.
## Publishing Notes
When publishing to Hugging Face Datasets:
- create a dataset card
- document that mock preview rows are synthetic
- separate curated rows from raw candidates
- include license and privacy notes
- keep private images out of the repo
Curated v2 was published with:
```bash
.venv/bin/python -B scripts/publish_hf_dataset.py \
--dataset-file data/train/objectverse_sft_curated_v2.jsonl \
--repo-id qqyule/objectverse-diary-sft-curated \
--path-in-repo objectverse_sft_curated_v2.jsonl \
--commit-message "Upload Objectverse Diary curated v2 dataset"
```
|