Spaces:
Running on Zero
Running on Zero
File size: 7,762 Bytes
6f8d8d9 bc02199 dd6cefc bc02199 e20e3d9 d30bd8e dd6cefc 9e874de dd6cefc 6f8d8d9 9e874de dd6cefc 9e874de dd6cefc 9e874de dd6cefc 9e874de dd6cefc 9e874de dd6cefc 9e874de dd6cefc 9e874de dd6cefc 9e874de dd6cefc e20e3d9 d30bd8e e20e3d9 dd6cefc d30bd8e dd6cefc d30bd8e dd6cefc d30bd8e dd6cefc | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 | # Scripts
Automation scripts for dataset generation, fine-tuning, GGUF conversion, llama.cpp runtime, and trace export.
Implemented initial scripts:
- `check_initial_stage.py`: verifies required files, runtime defaults, sample traces, pipeline, and Gradio build.
- `generate_sample_traces.py`: creates six stable public mock traces under `data/traces/samples/`.
- `generate_dataset.py`: creates deterministic SFT preview JSONL for schema and curation planning.
- `prepare_curated_dataset.py`: creates deterministic synthetic curated SFT rows; v1 defaults to 50 rows, v2 defaults to 200 rows across 40 objects and 5 modes.
- `export_traces.py`: exports validated public sample traces to JSONL for dataset-style publishing.
- `check_space_vlm.py`: validates MiniCPM-V object understanding on the hosted Hugging Face Space with three temporary public test images.
- `check_llama_cpp_smoke.py`: smoke-tests the optional llama.cpp text runtime with an external GGUF model.
- `finetune_lora.py`: validates SFT JSONL locally and defines the Modal LoRA training scaffold with optional eval split, assistant-output-only loss, and tunable LoRA/batch settings.
- `publish_hf_dataset.py`: validates and uploads curated JSONL files to a Hugging Face Dataset repository.
- `publish_hf_adapter.py`: uploads a downloaded LoRA adapter folder to Hugging Face Hub.
- `merge_lora_adapter.py`: merges a local PEFT LoRA adapter into a Hugging Face base model and saves tokenizer files.
- `publish_hf_gguf.py`: validates and uploads a local GGUF file to a Hugging Face model repository.
Expected files during implementation:
- `convert_to_gguf.sh`
- `run_llama_cpp.sh`
Modal LoRA dry-run:
```bash
.venv/bin/python -B scripts/finetune_lora.py \
--dry-run \
--dataset data/train/objectverse_sft_curated.jsonl \
--run-name objectverse-diary-qwen15b-curated-test
```
Modal LoRA v2 dry-run for a larger curated dataset:
```bash
.venv/bin/python -B scripts/prepare_curated_dataset.py \
--version v2 \
--count 200 \
--output data/train/objectverse_sft_curated_v2.jsonl
```
Publish curated v2 dataset:
```bash
.venv/bin/python -B scripts/publish_hf_dataset.py \
--dataset-file data/train/objectverse_sft_curated_v2.jsonl \
--repo-id qqyule/objectverse-diary-sft-curated \
--path-in-repo objectverse_sft_curated_v2.jsonl
```
```bash
.venv/bin/python -B scripts/finetune_lora.py \
--dry-run \
--dataset data/train/objectverse_sft_curated_v2.jsonl \
--run-name objectverse-diary-qwen15b-lora-v2 \
--max-steps 120 \
--learning-rate 1e-4 \
--max-seq-length 1536 \
--lora-r 32 \
--lora-alpha 64 \
--per-device-train-batch-size 2 \
--gradient-accumulation-steps 4 \
--eval-ratio 0.1 \
--eval-steps 20
```
Modal LoRA v2 training:
```bash
modal run --timestamps -n objectverse-diary-qwen15b-lora-v2 scripts/finetune_lora.py \
--dataset data/train/objectverse_sft_curated_v2.jsonl \
--run-name objectverse-diary-qwen15b-lora-v2 \
--max-steps 120 \
--learning-rate 1e-4 \
--max-seq-length 1536 \
--lora-r 32 \
--lora-alpha 64 \
--per-device-train-batch-size 2 \
--gradient-accumulation-steps 4 \
--eval-ratio 0.1 \
--eval-steps 20
```
For epoch-based experiments, set `--max-steps 0` and provide `--num-train-epochs`.
Assistant-output-only loss is enabled by default; pass `--no-assistant-only-loss` only for debugging full-text loss behavior.
Training dependencies are intentionally separate from the Space runtime:
```bash
pip install -r requirements-training.txt
```
Do not commit Modal credit codes, tokens, Hugging Face tokens, generated adapters, GGUF files, or private datasets.
If `modal run` reports `Token missing`, authenticate outside the repository first:
```bash
modal token new
```
or configure `MODAL_TOKEN_ID` and `MODAL_TOKEN_SECRET` through your shell/secret manager.
After a successful Modal run, download the adapter from the output volume into ignored local exports. Modal's directory download behavior can vary; downloading individual adapter files into a directory is the safest path.
```bash
mkdir -p exports/objectverse-diary-qwen15b-lora-v2-adapter-dir
for file in vocab.json tokenizer_config.json tokenizer.json special_tokens_map.json merges.txt chat_template.jinja added_tokens.json adapter_model.safetensors adapter_config.json README.md; do
modal volume get objectverse-diary-lora-output \
"objectverse-diary-qwen15b-lora-v2/adapter/$file" \
"exports/objectverse-diary-qwen15b-lora-v2-adapter-dir/$file"
done
modal volume get objectverse-diary-lora-output \
objectverse-diary-qwen15b-lora-v2/metrics.json \
exports/objectverse-diary-qwen15b-lora-v2-adapter-dir/training_metrics.json
modal volume get objectverse-diary-lora-output \
objectverse-diary-qwen15b-lora-v2/training_config.json \
exports/objectverse-diary-qwen15b-lora-v2-adapter-dir/training_config.json
```
Then upload the adapter to Hugging Face Hub:
```bash
.venv/bin/python -B scripts/publish_hf_adapter.py \
--adapter-dir exports/objectverse-diary-qwen15b-lora-v2-adapter-dir \
--repo-id qqyule/objectverse-diary-qwen15b-lora \
--commit-message "Upload Objectverse Diary Qwen 1.5B LoRA v2"
```
LoRA v2 GGUF conversion and upload:
```bash
.venv/bin/python -B scripts/merge_lora_adapter.py \
--base-model Qwen/Qwen2.5-1.5B-Instruct \
--adapter exports/objectverse-diary-qwen15b-lora-v2-adapter-dir \
--output exports/objectverse-diary-qwen15b-lora-v2-merged-hf
git clone https://github.com/ggml-org/llama.cpp.git .tmp/llama.cpp
git -C .tmp/llama.cpp checkout 8f83d6c271d194bde2d410145a0ce73bc42e85cd
cmake -S .tmp/llama.cpp -B .tmp/llama.cpp/build -DCMAKE_BUILD_TYPE=Release
cmake --build .tmp/llama.cpp/build --target llama-quantize -j
.venv/bin/python .tmp/llama.cpp/convert_hf_to_gguf.py \
exports/objectverse-diary-qwen15b-lora-v2-merged-hf \
--outfile models/objectverse-diary-qwen15b-lora-v2-f16.gguf \
--outtype f16
.tmp/llama.cpp/build/bin/llama-quantize \
models/objectverse-diary-qwen15b-lora-v2-f16.gguf \
models/objectverse-diary-qwen15b-lora-v2-q4_k_m.gguf \
Q4_K_M
.venv/bin/python -B scripts/publish_hf_gguf.py \
--gguf-file models/objectverse-diary-qwen15b-lora-v2-q4_k_m.gguf \
--repo-id qqyule/objectverse-diary-qwen15b-lora \
--path-in-repo objectverse-diary-qwen15b-lora-v2-q4_k_m.gguf \
--commit-message "Upload Objectverse Diary Qwen 1.5B LoRA v2 Q4_K_M GGUF"
```
The final Q4_K_M GGUF is ignored under `models/`. After upload, remove only generated intermediates such as the merged HF folder and F16 GGUF.
Space VLM validation:
```bash
.venv/bin/python -B scripts/check_space_vlm.py \
--space-url https://huggingface.co/spaces/build-small-hackathon/ObjectverseDiary \
--output docs/SPACE_VLM_REPORT.md \
--json-output docs/SPACE_VLM_REPORT.json \
--failure-notes-output docs/FAILURES.md
```
External Space changes are explicit:
```bash
.venv/bin/python -B scripts/check_space_vlm.py --configure-space --rollback-to-mock
```
Local LoRA v2 GGUF smoke test:
```bash
.venv/bin/python -B scripts/check_llama_cpp_smoke.py \
--model-path models/objectverse-diary-qwen15b-lora-v2-q4_k_m.gguf
```
Published GGUF source: `qqyule/objectverse-diary-qwen15b-lora`, file `objectverse-diary-qwen15b-lora-v2-q4_k_m.gguf`. Do not commit the downloaded file.
Current status: mock trace generation, trace JSONL export, SFT preview generation, synthetic curated v2 dataset publishing, optional MiniCPM-V wiring, optional llama.cpp wiring, hosted Space VLM validation tooling with non-secret probe support, local GGUF smoke helper, Modal LoRA training scaffolding, Modal LoRA v2 training, HF adapter publishing, GGUF conversion, GGUF upload, and local llama.cpp smoke are implemented. Real text-model validation on Space is not completed yet.
|