Instructions to use Praneshrajan15/DataForge-0.5B-SFT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Praneshrajan15/DataForge-0.5B-SFT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Praneshrajan15/DataForge-0.5B-SFT") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Praneshrajan15/DataForge-0.5B-SFT") model = AutoModelForCausalLM.from_pretrained("Praneshrajan15/DataForge-0.5B-SFT") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Praneshrajan15/DataForge-0.5B-SFT with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Praneshrajan15/DataForge-0.5B-SFT" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Praneshrajan15/DataForge-0.5B-SFT", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Praneshrajan15/DataForge-0.5B-SFT
- SGLang
How to use Praneshrajan15/DataForge-0.5B-SFT with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Praneshrajan15/DataForge-0.5B-SFT" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Praneshrajan15/DataForge-0.5B-SFT", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Praneshrajan15/DataForge-0.5B-SFT" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Praneshrajan15/DataForge-0.5B-SFT", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Praneshrajan15/DataForge-0.5B-SFT with Docker Model Runner:
docker model run hf.co/Praneshrajan15/DataForge-0.5B-SFT
DataForge-0.5B-SFT
DataForge-0.5B-SFT is a supervised-fine-tuned warmup checkpoint for tabular
data-quality repair experiments. The current training path uses chunk-level
DataForge expert trajectories whose exact repairs are derived from audited
dirty/clean CSV diffs. The earlier v0-smoke release only proved the
Kaggle-to-Hugging-Face pipeline and should not be read as a performance claim.
Intended Use
- Research on tabular data-quality agents and repair planning.
- Offline evaluation on DataForge-Bench-style Hospital, Flights, and Beers tasks.
- Warm-starting later DataForge RL experiments.
This checkpoint is not intended for autonomous production data modification, medical decision support, regulated data governance, or unsupervised repair of private datasets.
Training Data
- Dataset repo:
Praneshrajan15/dataforge-sft-trajectories. - Dataset repo SHA used for this run:
1e8612e5ddd48ef2d7ab78592059d187bd67ba3e. - Training examples:
1226chunk-levelexpert_v2JSONL records. - Data sources: Raha benchmark Hospital, Flights, and Beers datasets via the BigDaMa/raha repository.
- Primary label source:
oracle_from_clean_diffdirty/clean CSV diffs. - Legacy teacher lineage: Groq-hosted
clean-diff-v1ReAct smoke records may remain for auditability, but exact repairs are not teacher-discovered labels. - Flights schedule and actual-time repairs are supervised from dirty/clean labels; they are not inferred from incomplete prompt context.
- Split safety: held-out rows are reserved before chunking and excluded from SFT target rows, context rows, normalization candidates, fixes, and messages.
- Hard negatives: clean train chunks are retained as
finishexamples with empty repairs so the model is penalized for unnecessary edits.
The trajectory JSONL includes state, tool calls, diagnosis text, proposed fixes, teacher/oracle metadata, benchmark metrics, split metadata, and source provenance for auditability.
Training Procedure
- Base model:
Qwen/Qwen2.5-0.5B-Instruct. - Method: 4-bit QLoRA warmup, then LoRA merge into fp16 merged weights.
- Compute target: Kaggle or Hugging Face remote GPU only; no laptop model training or full evaluation.
- Kaggle hours used:
0.794. - Epochs: 2.
- Batch size: 1 per device with gradient accumulation of 16.
- Learning rate: 2e-5.
Evaluation
Evaluation is reported on held-out DataForge-Bench-style tasks sampled after the
training trajectory seeds. The release status generated by the notebook is
diagnostic_complete_no_gain. Only quality_improved_verified should be treated as a
quality milestone. diagnostic_complete_no_gain means the run is authentic and
published, but not promoted.
| Model | Held-out macro F1 |
|---|---|
Qwen/Qwen2.5-0.5B-Instruct |
0.002 |
DataForge-0.5B-SFT |
0.0 |
Release gates:
- Parse success:
0.94. - Schema-case errors:
45. - Quality milestone:
False.
These numbers are produced by the publishing notebook and should not be edited
manually. Re-run the notebook to regenerate them. Detailed per-dataset metrics
are stored in training_metrics.json under base_eval and sft_eval.
Bounded per-task failure evidence is stored in eval_diagnostics.json.
Limitations
- The checkpoint is a Week 9 warmup model, not the final DataForge model family.
- It has only seen small chunk-level ReAct traces and may fail on larger schemas, unseen domains, adversarial dirty values, or tasks requiring multi-step database access.
- Legacy teacher traces can contain teacher errors; the primary current labels come from exact dirty/clean diffs.
- The model should be used behind DataForge's safety, verifier, and transaction layers before any real data changes.
License
Weights are published as apache-2.0 after verifying the base model
metadata for Qwen/Qwen2.5-0.5B-Instruct. Users must also comply with the source dataset
licenses/terms and the teacher model terms that governed trajectory generation.
- Downloads last month
- 113