File size: 7,762 Bytes
6f8d8d9
 
bc02199
 
 
 
 
 
 
dd6cefc
bc02199
e20e3d9
d30bd8e
dd6cefc
 
9e874de
dd6cefc
 
6f8d8d9
 
 
 
 
 
9e874de
 
 
 
 
 
 
 
 
dd6cefc
9e874de
 
dd6cefc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9e874de
 
dd6cefc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9e874de
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dd6cefc
9e874de
 
dd6cefc
 
9e874de
dd6cefc
 
 
 
 
 
9e874de
 
 
 
 
 
dd6cefc
 
 
9e874de
 
dd6cefc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e20e3d9
 
 
 
 
d30bd8e
 
 
e20e3d9
 
 
 
 
 
 
 
dd6cefc
d30bd8e
 
 
dd6cefc
d30bd8e
 
dd6cefc
d30bd8e
dd6cefc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
# Scripts

Automation scripts for dataset generation, fine-tuning, GGUF conversion, llama.cpp runtime, and trace export.

Implemented initial scripts:

- `check_initial_stage.py`: verifies required files, runtime defaults, sample traces, pipeline, and Gradio build.
- `generate_sample_traces.py`: creates six stable public mock traces under `data/traces/samples/`.
- `generate_dataset.py`: creates deterministic SFT preview JSONL for schema and curation planning.
- `prepare_curated_dataset.py`: creates deterministic synthetic curated SFT rows; v1 defaults to 50 rows, v2 defaults to 200 rows across 40 objects and 5 modes.
- `export_traces.py`: exports validated public sample traces to JSONL for dataset-style publishing.
- `check_space_vlm.py`: validates MiniCPM-V object understanding on the hosted Hugging Face Space with three temporary public test images.
- `check_llama_cpp_smoke.py`: smoke-tests the optional llama.cpp text runtime with an external GGUF model.
- `finetune_lora.py`: validates SFT JSONL locally and defines the Modal LoRA training scaffold with optional eval split, assistant-output-only loss, and tunable LoRA/batch settings.
- `publish_hf_dataset.py`: validates and uploads curated JSONL files to a Hugging Face Dataset repository.
- `publish_hf_adapter.py`: uploads a downloaded LoRA adapter folder to Hugging Face Hub.
- `merge_lora_adapter.py`: merges a local PEFT LoRA adapter into a Hugging Face base model and saves tokenizer files.
- `publish_hf_gguf.py`: validates and uploads a local GGUF file to a Hugging Face model repository.

Expected files during implementation:

- `convert_to_gguf.sh`
- `run_llama_cpp.sh`

Modal LoRA dry-run:

```bash
.venv/bin/python -B scripts/finetune_lora.py \
  --dry-run \
  --dataset data/train/objectverse_sft_curated.jsonl \
  --run-name objectverse-diary-qwen15b-curated-test
```

Modal LoRA v2 dry-run for a larger curated dataset:

```bash
.venv/bin/python -B scripts/prepare_curated_dataset.py \
  --version v2 \
  --count 200 \
  --output data/train/objectverse_sft_curated_v2.jsonl
```

Publish curated v2 dataset:

```bash
.venv/bin/python -B scripts/publish_hf_dataset.py \
  --dataset-file data/train/objectverse_sft_curated_v2.jsonl \
  --repo-id qqyule/objectverse-diary-sft-curated \
  --path-in-repo objectverse_sft_curated_v2.jsonl
```

```bash
.venv/bin/python -B scripts/finetune_lora.py \
  --dry-run \
  --dataset data/train/objectverse_sft_curated_v2.jsonl \
  --run-name objectverse-diary-qwen15b-lora-v2 \
  --max-steps 120 \
  --learning-rate 1e-4 \
  --max-seq-length 1536 \
  --lora-r 32 \
  --lora-alpha 64 \
  --per-device-train-batch-size 2 \
  --gradient-accumulation-steps 4 \
  --eval-ratio 0.1 \
  --eval-steps 20
```

Modal LoRA v2 training:

```bash
modal run --timestamps -n objectverse-diary-qwen15b-lora-v2 scripts/finetune_lora.py \
  --dataset data/train/objectverse_sft_curated_v2.jsonl \
  --run-name objectverse-diary-qwen15b-lora-v2 \
  --max-steps 120 \
  --learning-rate 1e-4 \
  --max-seq-length 1536 \
  --lora-r 32 \
  --lora-alpha 64 \
  --per-device-train-batch-size 2 \
  --gradient-accumulation-steps 4 \
  --eval-ratio 0.1 \
  --eval-steps 20
```

For epoch-based experiments, set `--max-steps 0` and provide `--num-train-epochs`.
Assistant-output-only loss is enabled by default; pass `--no-assistant-only-loss` only for debugging full-text loss behavior.

Training dependencies are intentionally separate from the Space runtime:

```bash
pip install -r requirements-training.txt
```

Do not commit Modal credit codes, tokens, Hugging Face tokens, generated adapters, GGUF files, or private datasets.

If `modal run` reports `Token missing`, authenticate outside the repository first:

```bash
modal token new
```

or configure `MODAL_TOKEN_ID` and `MODAL_TOKEN_SECRET` through your shell/secret manager.

After a successful Modal run, download the adapter from the output volume into ignored local exports. Modal's directory download behavior can vary; downloading individual adapter files into a directory is the safest path.

```bash
mkdir -p exports/objectverse-diary-qwen15b-lora-v2-adapter-dir
for file in vocab.json tokenizer_config.json tokenizer.json special_tokens_map.json merges.txt chat_template.jinja added_tokens.json adapter_model.safetensors adapter_config.json README.md; do
  modal volume get objectverse-diary-lora-output \
    "objectverse-diary-qwen15b-lora-v2/adapter/$file" \
    "exports/objectverse-diary-qwen15b-lora-v2-adapter-dir/$file"
done
modal volume get objectverse-diary-lora-output \
  objectverse-diary-qwen15b-lora-v2/metrics.json \
  exports/objectverse-diary-qwen15b-lora-v2-adapter-dir/training_metrics.json
modal volume get objectverse-diary-lora-output \
  objectverse-diary-qwen15b-lora-v2/training_config.json \
  exports/objectverse-diary-qwen15b-lora-v2-adapter-dir/training_config.json
```

Then upload the adapter to Hugging Face Hub:

```bash
.venv/bin/python -B scripts/publish_hf_adapter.py \
  --adapter-dir exports/objectverse-diary-qwen15b-lora-v2-adapter-dir \
  --repo-id qqyule/objectverse-diary-qwen15b-lora \
  --commit-message "Upload Objectverse Diary Qwen 1.5B LoRA v2"
```

LoRA v2 GGUF conversion and upload:

```bash
.venv/bin/python -B scripts/merge_lora_adapter.py \
  --base-model Qwen/Qwen2.5-1.5B-Instruct \
  --adapter exports/objectverse-diary-qwen15b-lora-v2-adapter-dir \
  --output exports/objectverse-diary-qwen15b-lora-v2-merged-hf

git clone https://github.com/ggml-org/llama.cpp.git .tmp/llama.cpp
git -C .tmp/llama.cpp checkout 8f83d6c271d194bde2d410145a0ce73bc42e85cd
cmake -S .tmp/llama.cpp -B .tmp/llama.cpp/build -DCMAKE_BUILD_TYPE=Release
cmake --build .tmp/llama.cpp/build --target llama-quantize -j

.venv/bin/python .tmp/llama.cpp/convert_hf_to_gguf.py \
  exports/objectverse-diary-qwen15b-lora-v2-merged-hf \
  --outfile models/objectverse-diary-qwen15b-lora-v2-f16.gguf \
  --outtype f16

.tmp/llama.cpp/build/bin/llama-quantize \
  models/objectverse-diary-qwen15b-lora-v2-f16.gguf \
  models/objectverse-diary-qwen15b-lora-v2-q4_k_m.gguf \
  Q4_K_M

.venv/bin/python -B scripts/publish_hf_gguf.py \
  --gguf-file models/objectverse-diary-qwen15b-lora-v2-q4_k_m.gguf \
  --repo-id qqyule/objectverse-diary-qwen15b-lora \
  --path-in-repo objectverse-diary-qwen15b-lora-v2-q4_k_m.gguf \
  --commit-message "Upload Objectverse Diary Qwen 1.5B LoRA v2 Q4_K_M GGUF"
```

The final Q4_K_M GGUF is ignored under `models/`. After upload, remove only generated intermediates such as the merged HF folder and F16 GGUF.

Space VLM validation:

```bash
.venv/bin/python -B scripts/check_space_vlm.py \
  --space-url https://huggingface.co/spaces/build-small-hackathon/ObjectverseDiary \
  --output docs/SPACE_VLM_REPORT.md \
  --json-output docs/SPACE_VLM_REPORT.json \
  --failure-notes-output docs/FAILURES.md
```

External Space changes are explicit:

```bash
.venv/bin/python -B scripts/check_space_vlm.py --configure-space --rollback-to-mock
```

Local LoRA v2 GGUF smoke test:

```bash
.venv/bin/python -B scripts/check_llama_cpp_smoke.py \
  --model-path models/objectverse-diary-qwen15b-lora-v2-q4_k_m.gguf
```

Published GGUF source: `qqyule/objectverse-diary-qwen15b-lora`, file `objectverse-diary-qwen15b-lora-v2-q4_k_m.gguf`. Do not commit the downloaded file.

Current status: mock trace generation, trace JSONL export, SFT preview generation, synthetic curated v2 dataset publishing, optional MiniCPM-V wiring, optional llama.cpp wiring, hosted Space VLM validation tooling with non-secret probe support, local GGUF smoke helper, Modal LoRA training scaffolding, Modal LoRA v2 training, HF adapter publishing, GGUF conversion, GGUF upload, and local llama.cpp smoke are implemented. Real text-model validation on Space is not completed yet.