DataForge-0.5B-SFT

DataForge-0.5B-SFT is a supervised-fine-tuned warmup checkpoint for tabular data-quality repair experiments. The current training path uses chunk-level DataForge expert trajectories whose exact repairs are derived from audited dirty/clean CSV diffs. The earlier v0-smoke release only proved the Kaggle-to-Hugging-Face pipeline and should not be read as a performance claim.

Intended Use

  • Research on tabular data-quality agents and repair planning.
  • Offline evaluation on DataForge-Bench-style Hospital, Flights, and Beers tasks.
  • Warm-starting later DataForge RL experiments.

This checkpoint is not intended for autonomous production data modification, medical decision support, regulated data governance, or unsupervised repair of private datasets.

Training Data

  • Dataset repo: Praneshrajan15/dataforge-sft-trajectories.
  • Dataset repo SHA used for this run: 1e8612e5ddd48ef2d7ab78592059d187bd67ba3e.
  • Training examples: 1226 chunk-level expert_v2 JSONL records.
  • Data sources: Raha benchmark Hospital, Flights, and Beers datasets via the BigDaMa/raha repository.
  • Primary label source: oracle_from_clean_diff dirty/clean CSV diffs.
  • Legacy teacher lineage: Groq-hosted clean-diff-v1 ReAct smoke records may remain for auditability, but exact repairs are not teacher-discovered labels.
  • Flights schedule and actual-time repairs are supervised from dirty/clean labels; they are not inferred from incomplete prompt context.
  • Split safety: held-out rows are reserved before chunking and excluded from SFT target rows, context rows, normalization candidates, fixes, and messages.
  • Hard negatives: clean train chunks are retained as finish examples with empty repairs so the model is penalized for unnecessary edits.

The trajectory JSONL includes state, tool calls, diagnosis text, proposed fixes, teacher/oracle metadata, benchmark metrics, split metadata, and source provenance for auditability.

Training Procedure

  • Base model: Qwen/Qwen2.5-0.5B-Instruct.
  • Method: 4-bit QLoRA warmup, then LoRA merge into fp16 merged weights.
  • Compute target: Kaggle or Hugging Face remote GPU only; no laptop model training or full evaluation.
  • Kaggle hours used: 0.794.
  • Epochs: 2.
  • Batch size: 1 per device with gradient accumulation of 16.
  • Learning rate: 2e-5.

Evaluation

Evaluation is reported on held-out DataForge-Bench-style tasks sampled after the training trajectory seeds. The release status generated by the notebook is diagnostic_complete_no_gain. Only quality_improved_verified should be treated as a quality milestone. diagnostic_complete_no_gain means the run is authentic and published, but not promoted.

Model Held-out macro F1
Qwen/Qwen2.5-0.5B-Instruct 0.002
DataForge-0.5B-SFT 0.0

Release gates:

  • Parse success: 0.94.
  • Schema-case errors: 45.
  • Quality milestone: False.

These numbers are produced by the publishing notebook and should not be edited manually. Re-run the notebook to regenerate them. Detailed per-dataset metrics are stored in training_metrics.json under base_eval and sft_eval. Bounded per-task failure evidence is stored in eval_diagnostics.json.

Limitations

  • The checkpoint is a Week 9 warmup model, not the final DataForge model family.
  • It has only seen small chunk-level ReAct traces and may fail on larger schemas, unseen domains, adversarial dirty values, or tasks requiring multi-step database access.
  • Legacy teacher traces can contain teacher errors; the primary current labels come from exact dirty/clean diffs.
  • The model should be used behind DataForge's safety, verifier, and transaction layers before any real data changes.

License

Weights are published as apache-2.0 after verifying the base model metadata for Qwen/Qwen2.5-0.5B-Instruct. Users must also comply with the source dataset licenses/terms and the teacher model terms that governed trajectory generation.

Downloads last month
113
Safetensors
Model size
0.5B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Praneshrajan15/DataForge-0.5B-SFT

Finetuned
(764)
this model
Quantizations
1 model

Dataset used to train Praneshrajan15/DataForge-0.5B-SFT